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Abstract 

Background: A convergence of high-throughput sequencing and computational power is transforming biology 
into information science. Despite these technological advances, converting bits and bytes of sequence information 
into meaningful insights remains a challenging enterprise. Biological systems operate on multiple hierarchical levels 
from genomes to biomes. Holistic understanding of biological systems requires agile software tools that permit 
comparative analyses across multiple information levels (DNA, RNA, protein, and metabolites) to identify emergent 
properties, diagnose system states, or predict responses to environmental change. 

Results: Here we adopt the Meta Pathways annotation and analysis pipeline and Pathway Tools to construct 
environmental pathway/genome databases (ePGDBs) that describe microbial community metabolism using MetaCyc, a 
highly curated database of metabolic pathways and components covering all domains of life. We evaluate Pathway Tools' 
performance on three datasets with different complexity and coding potential, including simulated metagenomes, a 
symbiotic system, and the Hawaii Ocean Time-series. We define accuracy and sensitivity relationships between read 
length, coverage and pathway recovery and evaluate the impact of taxonomic pruning on ePGDB construction and 
interpretation. Resulting ePGDBs provide interactive metabolic maps, predict emergent metabolic pathways associated 
with biosynthesis and energy production and differentiate between genomic potential and phenotypic expression 
across defined environmental gradients. 

Conclusions: This multi-tiered analysis provides the user community with specific operating guidelines, performance 
metrics and prediction hazards for more reliable ePGDB construction and interpretation. Moreover, it demonstrates the 
power of Pathway Tools in predicting metabolic interactions in natural and engineered ecosystems. 



Background 

Community interactions between uncultivated microor- 
ganisms give rise to dynamic metabolic networks integral 
to ecosystem function and global scale biogeochemical 
cycles [1]. Metagenomics bridges the "cultivation gap" 
through plurality or single-cell sequencing by providing 
direct and quantitative insight into microbial community 
structure and function [2,3]. Although, new technologies 
are rapidly expanding our capacity to chart microbial 
sequence space, persistent computational and analytical 
bottlenecks impede comparative analyses across multiple 
information levels (DNA, RNA, protein and metabolites) 
[4,5]. This in turn limits our ability to convert the genetic 
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potential and phenotypic expression of microbial com- 
munities into predictive insights and technological or 
therapeutic innovations. 

Functional genes operate within the structure of meta- 
bolic pathways and reactions that define metabolic 
networks. Despite this fact, few metagenomic studies 
use pathway-centric approaches to predict microbial 
community interaction networks based on known 
biochemical rules. Recently, algorithms for pathway 
prediction and metabolic flux have been developed for 
environmental sequence information including the Human 
Microbiome Project Unified Metabolic Analysis Network 
(HUMAnN) and Predicted Relative Metabolic Turnover 
(PRMT). HUMAnN uses an integer optimization 
algorithm that conservatively computes a parsimonious 
minimum set of reactions along KEGG pathways based on 
pathway presence, absence or completion [6,7]. PRMT 
infers metabolic flux based on normalized enzyme activity 
counts mapped to KEGG pathways across multiple meta- 
genomes [8]. Because KEGG pathways are coarse and do 
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not discriminate between pathway variants, both modes of 
analysis have limited metabolic resolution [9]. Moreover, 
neither HUMAnN nor PRMT provides a coherent 
structure for exploring and interpreting predicted KEGG 
pathways. 

One alternative to HUMAnN and PRMT is Pathway 
Tools, a production-quality software environment sup- 
porting metabolic inference and flux balance analysis 
based on the MetaCyc database of metabolic pathways 
and enzymes representing all domains of life [10-13]. 
Unlike KEGG or SEED subsystems, MetaCyc emphasizes 
smaller, evolutionarily conserved or co-regulated units of 
metabolism and contains the largest collection (over 2000) 
of experimentally validated metabolic pathways. Extensively 
commented pathway descriptions, literature citations, and 
enzyme properties combined within a pathway/genome 
database (PGDB) provide a coherent structure for explor- 
ing and interpreting predicted pathways. Although initially 
conceived for cellular organisms, recent development of 
the MetaPathways pipeline extends the PGDB concept to 
environmental sequence information enabling pathway- 
centric insights into microbial community structure and 
function [14,15]. 

Here we provide essential guidelines for generating 
and interpreting ePGDBs inspired by the multi-tiered 
structure of BioCyc [16] (Figure 1). We begin with genome 
and metagenome simulations to assess performance on 
datasets manifesting different read length, coverage and 
taxonomic diversity and we develop a weighted taxonomic 
distance to evaluate concordance between pathways 
predicted using environmental sequence information 
and reference pathways in the MetayCyc database. Given 
these metrics, we demonstrate Pathway Tools' power to 
predict emergent metabolism in simulated metagenomes 
and a previously characterized symbiotic system [17]. 
Finally, we generate ePGDBs using coupled metagenomic 
and metatranscriptomic datasets from the Hawaii Ocean 
Time-series (HOT) to compare and contrast genetic 
potential and phenotypic expression along defined 
environmental gradients in the ocean [18-20]. 

Results and discussion 

Performance considerations 

Environmental pathway/genome database (ePGDB) con- 
struction commences with the MetaPathways automated 
annotation pipeline using environmental sequence informa- 
tion as input (Materials and Methods). Resulting annota- 
tions are used by the PathoLogic algorithm implemented 
in Pathway Tools to predict metabolic pathways based on 
multiple criteria including proportion of pathways found, 
pathway specific enzymatic reactions, and purported 
taxon-specific pathway distributions. PathoLogic is known 
to perform well when compared to machine learning 
methods using the genomes of cellular organisms as input 



[21]. We previously reported PathoLogic s performance 
on combined and incomplete genomes using two simu- 
lated metagenomes (Siml and Sim2) derived from 10 
BioCyc tier-2 PGDBs manifesting different coverage and 
taxonomic diversity using MetaSim [14,22]. Simulations 
on increasing proportions of the total component genome 
length (Gm) showed that the performance of pathway 
recovery based on multiple metrics (F-measure, Matthews 
Correlation Coefficient, etc.) increased with sequence 
coverage and sample diversity nearing an asymptote at 
higher coverage (Figure 2a). This suggests that pathway 
prediction follows a collector s curve in which common 
core pathways accumulate in the early part of the curve 
followed by less common accessory pathways near the 
asymptote. 

To better constrain pathway recovery and performance 
in relation to ePGDB construction we compared results 
of MetaSim experiments using the Esherichia coli K12 
substr, MG1655 genome (basis of the EcoCyc database), 
Siml and Sim2, and a subsampled 25 m metagenome 
from HOT [19] (Additional file 1: Materials and Methods, 
Tables S1-S4 and Figure SI). Simulations were performed 
at progressively larger Gm coverage. Consistent with 
previous observations for Siml and Sim2, all experi- 
ments showed that pathway recovery percentage and 
performance sensitivity increased with sequence cover- 
age and sample diversity nearing an asymptote at higher 
coverage (Figure 2a-b). The absolute values of these pat- 
terns were sensitive to read length and likely reflected 
limits imposed by open reading frame prediction and 
BLAST/LAST-based annotation. In contrast, performance 
specificity was high (>85%) regardless of read length, 
coverage, or taxonomic diversity (Figure 2b). The rate of 
pathway recovery increased proportionally with increas- 
ing sample diversity at lower coverage values, as seen in 
the reduction of pathway recovery percentage between 
Siml, Sim2 and E. coli for long read (-700 bp) and be- 
tween HOT, Siml/2 and E, coli for short read (-160 bp) 
datasets. Additional performance metrics can be found 
in Additional file 1: Tables S5-S8. Because PathoLogic 
performance improves with increasing read length, 
coverage and sample diversity, sequencing platform selec- 
tion and use of assembled versus unassembled sequence in- 
formation should be considered when generating ePGDBs. 

When constructing PGDBs for individual genomes 
PathoLogic uses a process called taxonomic pruning to 
constrain pathway predictions within a specified taxonomic 
lineage by taking advantage of the curated 'taxonomic- 
range' associated with a given pathway. For example, if a 
pathway is found only in plants, it will be difficult to predict 
this pathway in the genome of a bacterial isolate when 
using taxonomic pruning. Such a process is intended to 
reduce false positive predictions in individual genomes 
[12]; However, microbial communities are composed of 
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Figure 1 A multi-tiered approach to ePGDB validation, (a) In the absence of highly curated and validated datasets, we took inspiration from 
the curation-tiered structure of available pathway/genome databases within the BioCyc family, (b/c) Through In silico simulated sequencing 
experiments on the E. coli K12 genome and two simulated metagenomes, we evaluated the performance of the Pathologic algorithm under 
changing sequence coverage and taxonomic distributions, (d) We reanalyzed the genomes of Candid atus Moronello endobia and Candidatus 
Trennbloyo princeps, two symbiotic taxa with reduced genomes, sharing a number of essential amino acid pathways, (e) Finally, we predicted 
pathways from a previously analyzed paired metagenomic and metatranscriptomic dataset from the Hawaii Ocean Time-series to validate on 
previously identified pathways and metabolic functions. 



diverse and largely uncultivated lineages whose combined 
metabolic potential and phenotypic expression must be 
considered both within and between individuals. Thus the 
taxonomic origin of environmental sequence information 
is more difficult to ascertain with the same degree of 
certainty as individual microbial genomes sourced from 
isolates or single-cells. Indeed, the true taxonomic range 
of many pathways remains to be constrained given the 
limited number of isolate genomes and the proclivity for 
horizontal gene transfer within microbial communities. 

In order to evaluate the impact of taxonomic pruning 
on pathway recovery from environmental sequence in- 
formation we constructed ePGDBs enabling or disabling 
taxon-specific pathway distributions (Additional file 1: 
Table S9). We ran PathoLogic on Siml/2 and 25 m HOT 
datasets with the 'Unclassified sequences' pruning thresh- 
old and without pruning. With taxonomic pruning enabled, 
long read and short read Siml ePGDBs exhibited a reduc- 
tion of 56% (206 compared to 604) and 61% (194 compared 
to 499) predicted pathways, respectively. Interestingly, the 
subsampled 25 m HOT dataset exhibited a 28% reduction 
(425 compared to 593) in pathway recovery with and with- 
out pruning suggesting that increased sample complexity 



can partially offset taxon specific sensitivity losses. In all 
cases, the pathways predicted with taxonomic pruning were 
a subset of pathways predicted without taxonomic pruning. 
Given these observations we posit that strict taxonomic 
pruning is inappropriate for ePGDB construction while rec- 
ognizing potential prediction hazards associated with path- 
ways predicted outside of their expected taxonomic range. 

To evaluate concordance between pathways predicted 
using environmental sequence information and reference 
pathways in the MetaCyc database we developed a weighted 
taxonomic distance (WTD) algorithm. The WTD algorithm 
measures the taxonomic distance between predicted coding 
DNA sequences (CDS), e.g., BLAST hits from the RefSeq 
database, and expected taxonomic range for each predicted 
pathway using the NCBI Taxonomy Database. The NCBI 
Taxonomy Database is hierarchically structured, and a path 
between the lowest common ancestor (LCA) of observed 
CDS annotations and each member of the expected 
taxonomic range in a pathway can be charted [23], 
where each path length represents some measure of 
taxonomic distance e.g. root, cellular organism, domain, 
phylum/division, class, order, family, genus, species. Steps 
on the path near the root of the hierarchy define greater 
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Figure 2 Analysis on in silico simulated sequencing experiments across different levels of coverage, sequencing lengths, and taxonomic 
distributions, (a) Predicted pathway recovery as a fraction of the total pathways predicted from the full genomes, (b) Sensitivity (circles) and precision 
(triangles) of predicted pathways of the in silico experiments using the pathways predicted on the full genomes as the gold standard. 



evolutionary distances than those near the tips. Thus the 
WTD algorithm weights steps on the connecting path by 
a factor of ^, where d is the depth position of a particular 
taxon in the hierarchy (Additional file 1: Supplementary 
Note 2). To distinguish between paths descending from 
the expected taxonomic range and those falling outside 
the expected taxonomic range, paths descending from an 
expected taxonomic range have a non-negative distance 
and paths outside this range have a negative distance. The 
WTD algorithm gives preference to non-negative dis- 
tances within expected taxonomic range(s), returning the 
minimum distance if found. Otherwise the maximum 
negative distance (i.e., closest to zero) is returned. 



When the WTD algorithm was applied to HOT data- 
sets, the taxonomic distribution of predicted pathways 
generally aligned with the expected taxonomic ranges 
of MetaCyc Pathways (Additional file 1: Figure S2). 
Predicted pathways were classified into four categories 
of taxonomic disagreement based on their WTD: 
"None" if the WTD was positive, and "Low", "Medium", 
and "High" if less than or equal to zero, based on distance 
quartiles. A pathway had "Low" taxonomic disagreement 
if in the upper two quartiles of negative distances (i.e., 
those closest to zero), "Medium" if in the second quartile, 
and "High" if in the bottom (i.e., most negative) quartile. 
Pathways with expected taxonomic ranges affiliated with 
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bacteria and archaea dominated the "None", "Low", and 
"Medium" disagreement classes, while pathways with 
expected taxonomic ranges affiliated with eukaryotes 
including "animals", "fungi", and "plants" comprised the 
majority of the "High" disagreement class (Additional 
file 1: Figure S3). While not excluded from downstream 
analysis, pathways with distances in the "High" disagree- 
ment class are more likely to represent false positives 
and should be interpreted with care. 

Distributed metabolic pathways 

Public good dynamics play an integral role in shaping 
microbial interactions through distributed networks of 
metabolite exchange [24]. Such networks promote 
increased fitness and resilience and may explain the 
underlying difficulty in cultivating most environmental 
microorganisms [25-27]. Because ePGDBs are constructed 
from environmental sequence information, predicted 
pathways are represented by multiple donor genotypes 
providing different levels of sequence coverage for each 
reaction. By comparing pathway recovery for individual 
reference genomes to pathway recovery for combinations 
of reference genomes, it becomes formally possible to use 
Pathway Tools to identify distributed metabolic pathways 
that emerge between multiple interacting partners. To 
test this hypothesis, we selected four Tier-2 reference 
genomes used in simulation experiments and constructed 
ePGDBs using all possible pair- wise genome combinations 
(Additional file 1: Table SIO). Thirty distributed pathways 
were identified in pair-wise genome combinations that 
were not predicted in PGDBs for individual cellular or- 
ganisms using set-difference analysis (Additional file 1: 
Table Sll). Common and unique reactions associated 
with distributed pathways could be identified as com- 
posite glyphs in the Pathway Tools genome browser 
(Additional file 1: Figure S4). 

To provide a real world example of distributed meta- 
bolic pathway prediction we selected a symbiotic system 
with known nutritional provisioning requirements. The 
reduced genomes of Candidatus Moranella endobia and 
Candidatus Tremblaya princeps (GenBank NC-0 15735 
and NC-0 15736), bacterial endosymbionts of the mealy- 
bug Planococcus citri have been previously described by 
McCutcheon and colleagues to distribute biosynthetic 
pathways for essential amino acids in a process known 
as "inter-pathway complementarity." Environmental PGDB 
construction using the combined Moranella and Trem- 
blaya genomes recovered 43 out of 44 reactions and all 9 
distributed amino acid biosynthesis pathways previously 
reported (Figure 3 and Additional file 1: Figure S5). Given 
these results, combinatorial ePGDB construction has 
enormous potential to predict distributed metaboUc 
pathways within defined microbial assemblages e.g.. 



co-cultures or more complex microbial communities in 
natural and engineered ecosystems. 

Comparative community metabolism 

To evaluate Pathway Tools' performance on complex 
microbial communities at different information levels we 
compared and contrasted coupled metagenome (DNA) 
and metatranscriptome (RNA) datasets from 25, 75, 
110 m (sunlit or euphotic) and 500 m (dark) ocean 
depth intervals from HOT [19]. A total of 1026 unique 
pathways from approximately 1.2 biUion base pairs of 
environmental sequence information were recovered 
spanning defined environmental gradients including 
luminosity, salinity, pressure, and oxygen concentra- 
tion (Additional file 1: Table S12). Of these pathways, 
840 met minimal quality control (QC) standards 
(Materials and Methods) and were used for subsequent 
set-difference analysis (Figure 4a). 

More than 600 pathways were shared in common be- 
tween the sunlit and dark ocean based on combined DNA 
and RNA datasets consistent with a conserved metabolic 
core (Figure 4b). A total of 14 unique pathways were 
predicted exclusively in sunlit samples with 20 pathways 
predicted at the intersection of 25, 75 and 110 m depth 
intervals (Figure 4b). More than 100 unique pathways 
were predicted for the 500 m compliment consistent with 
increased metabolic potential and niche-specialization with 
increasing depth (Figure 4b). Interestingly, the normalized 
proportion of genetic potential (DNA) versus expressed 
metabolic pathways (DNA/RNA) increased linearly be- 
tween 25, 75 and 110 m depth intervals (0.4, 0.7 and 1.2, 
respectively) before plateauing at 500 m (1.2) (Figure 4c). It 
remains to be determined if this trend reflects an asymp- 
tote or an inflection point in pathway expression co- 
varying as a function of metabolic status, environmental 
conditions or sample coverage and QC. 

A total of 30 pathways were identified exclusively in 
RNA datasets including 11 pathway variants (Figure 4c 
and Additional file 1: Figure S6). Expressed cholesterol 
degradation and tetrahydrobiopterin biosynthesis I were 
common to aU depth intervals. Unique expressed photo- 
respiration and glycolate degradation III pathways were 
recovered at 25 and 75 m, while ammonia oxidation III, 
methane oxidation to methanol II, and arginine bio- 
synthesis III were unique to 500 m (Additional file 1: 
Figure S6). More than 590 pathways were identified 
exclusively in DNA datasets, while 495 were shared in 
common between DNA and RNA datasets (Figure 4d). 
With respect to functional classes, unique Degradation, 
Biosynthesis and Energy-Metabolism pathways increased 
as a function of depth in DNA datasets (Additional file 1: 
Figure S7a). Within unique degradation classes a 
progression from amino acids to aromatic-compounds 
and secondary metabolites was observed between 
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Figure 3 Examples of emergent amino acid metabolism shared between the Moranella endobia and Tremblaya princeps genomes. This 
figure illustrates examples of emergent metabolic pathways predicted between symbiotic prokaryotes Condidotus Moronello endobio and 
Candidatus Tremblaya princeps. Enyzmes found in Moranella (red), Tremblaya (blue), or both taxa (purple) are highlighted in the pathway glyph 
diagrams, showing patterns of potentially emergent metabolism. A complete description of all amino acid pathways can be found in Additional 
file 1: Figure S3. 



25, 75, 110 and 500 m depth intervals. A similar 
progression was observed for a subset of Biosynthetic 
classes including polyamines, lipids, and cofactors and 
for Energy-Metabolism including Cl-compounds and 
fermentation (Additional file 1: Figure S7b). 

An evaluation of the 72 most abundant pathways re- 
covered from the combined datasets indicated that 53 
were both present and expressed at 25, 75, 110, and 500 m 
depth intervals. Moreover, several of the most abundant 
pathways including ammonium transport, Rubisco shunt, 
NADH to cytochrome electron transfer, pyruvate fermen- 
tation, denitrification, Calvin-Benson-Bassham cycle, cyst- 
eine biosynthesis I and arginine biosynthesis III exhibited 
depth-dependent trends in gene expression (Additional 
file 1: Figure S8). A number of abundant pathways com- 
mon to 25, 75, 110, and 500 m depth intervals in the 
DNA datasets were exclusively expressed in sunlit or dark 
ocean waters (Figure 5). In sunlit waters these included 
photosynthesis light reactions, hydrogen production VIII, 
flavonoid biosynthesis, cofactors including heme, vitamin 
B-complex (thiamin, adenosylcobalamin), and glutathione 
for oxidative stress (Figure 5). Below the euphotic zone, 
the 500 m depth interval exclusively expressed path- 
ways for ribitol, rhamnose, guanosine nucleotide, 2- 
methylcitrate, and threonine degradation as well as 
pathways for cofactor biosynthesis including phospho- 
pantothenate, menaquinol-8 (vitamin K), and coenzyme 
M and several carbohydrate and amino acid biosynthetic 
pathways including CMP-N-acetylneuraminate I, ADP-L- 



glycero-beta-D-manno-heptose and glycine biosynthesis 
IV (Figure 5). 

Consistent with previous reports, sunlit waters expressed 
many photosynthesis-related pathways including aerobic 
electron transfer, hydrogen production, and cofactors 
including ubiquinol, heme, vitamin B-complex (nicotinate, 
thiamine, cobalamin, tetrahydrofolate), chlorophyll a, and 
retinol biosynthesis [19,20] (Additional file 1: Figures S9 
and SIO). In addition to photosynthesis, 25 and 75 m 
depth intervals (upper euphotic) sets included pathways 
associated with degradation of plant metabolites including 
phytate, glucuronate, mannitol, chitin, xylose, arabinose, 
gallate, and quinolate. Other pathways of interest identi- 
fied in sunlit waters included organophosphate, urea, and 
aminobutyrate degradation, as well as pathways for 
conversion of the plant hormone indole-3 acetic acid 
and mercury detoxification. Below the euphotic zone, 
the 500 m depth interval expressed unique pathways for 
intra-aerobic nitrite reduction, dissimilatory nitrate 
reduction, the reductive monocarboxylic acid cycle, 
ammonia oxidation, and methane oxidation to methanol I 
(Additional file 1: Figure Sll). Thus, comparative ePGDB 
analysis using the combined DNA and RNA datasets 
differentiated between genomic potential and pheno- 
typic expression across defined environmental gradients 
in the ocean and revealed known and novel patterns of 
functional specialization with potential implications for 
nutrient and energy flow within sunlit and dark ocean 
waters. 
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Figure 4 Analysis of predicted patiiways from thie Hawaii Ocean Time-series, (a) A total of 1033 unique pathways were predicted from the 
HOT samples (Additional file 3), however only 840 unique pathways remained after all pathways in each sample with less than 10 ORFs were 
removed (Additional file 4). (b) After normalizing by total predicted ORFs (Additional files 5 and 6), a 4-way set analysis of these quality controlled 
(QC) pathways shows that the samples share a large core of common pathways, (c) Separating unique pathways within the DNA and RNA of 
each sample revealed that very few pathways were unique to the RNA fraction of each sample, (d) Finally, at set analysis of the unique DNA 
fraction (light colors), and pathways common to DNA and RNA from each sample (dark colors) found subsets of pathways unique to each fraction 
(Additional files 7 and 8). 



Pathway prediction hazards 

While the construction of ePGDBs promotes pathway- 
centric analysis of environmental sequence information, 
prediction hazards need to be considered for optimal 
interpretive power. One common hazard is the multiple 
mapping problem/ arising when an enzyme catalyzes 
conserved or promiscuous reaction steps across mul- 
tiple pathways or enzyme commission (EC) numbers 
representing classes with non-specific substrate activity. 



For example EC 3.2.1.21 represents a non-specific enzyme 
class for beta-D-glucosides, allowing for spurious prediction 
of specific carbohydrate degradation pathways. Moreover, 
PathoLogic has a preference for EC numbers over product 
descriptions that can further exacerbate false discovery 
associated with non-specific enzyme classes. Hazards 
manifesting themselves within pathway variants sharing 
a number of common or reversible reaction steps have 
previously been described by Caspi and colleagues in 
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Figure 5 Comparison of predicted genomic and transcriptomic pathiways witii unique expression in the 'sunlit' and 'daric' HOT samples. 

Sunlit metabolism was indicative of photosynthesis and aerobic metabolism including photosynthesis light reactions and hydrogen production. Dark 
metabolism had significantly more degradation pathways. 



the context of PGDB construction for cellular organ- 
isms [28]. For example, the tricarboxyUc acid cycle 
(TCA) cycle has at least eight pathway variants associated 
with different taxonomic groups and several incomplete 
or reversible forms that share multiple reactions steps. 



Pathologic has difficulty differentiating between TCA 
cycle variants when reversible pathway components are 
present even when a diagnostic step such as ATP-citrate 
lyase for the reductive TCA cycle is missing from the 
input data. A similar problem occurs when a regulatory 
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protein is used to provide evidence that a pathway exists 
even when catalytic pathway components are missing 
from the input. Given that we constructed ePGDBs 
without taxonomic pruning and that PathoLogic uses 
automated annotations from multiple taxonomic groups 
when predicting pathways from environmental sequence 
information, taxon specific pathways such as plant hor- 
mone biosynthesis or innate immunity can be predicted 
even when organisms known to encode such pathways are 
absent from the dataset. As described in the performance 
considerations section, WTD can be used to discern dif- 
ferences between the predicted and expected taxonomic 
range of pathways pointing to potential hazards prior to 
interpretation. Indeed, the extent to which these predicted 
pathways reflect previously unrecognized variants or 
prediction artifacts remains to be determined. More- 
over, this hazard has the potential to confound distrib- 
uted metabolic pathway identification when sequence 
coverage is low or microbial community composition is 
extremely uneven. Some examples of these hazards 
from the HOT analysis are provided in Additional file 1: 
Table S13. 

The identification of dissimilatory nitrate reduction 
(denitrification), intra-aerobic nitrite reduction and am- 
monia oxidation in the combined 500 m HOT DNA and 
RNA datasets provides a real world example of hazard 
navigation. Denitrification is a distributed form of energy 
metabolism resulting in the production of nitrogen gas 
in oxygen- deficient waters (<20 (iM O2 per kg) [29,30]. 
The first step in denitrification is nitrate reduction to 
nitrite. In the combined HOT DNA and RNA datasets the 
predicted pathway variant nitrate reduction IV included a 
subset of CDS transcripts for nitrate reductase gamma 
subunif (24 in DNA, 79 in RNA) while the predicted 
pathway variant nitrate reduction I included CDS tran- 
scripts for multiple nitrate reductase subunits (Figure 6). 
While CDS for nitrate reductase subunits originated from 
a number of different taxa including Alphaproteobacteria, 
Gammaproteobacteria, Nitrospira and Planctomycetes, 
435 out of 523 (83%) predicted nitrate reductase transcripts 
originated from Nitrospira and Planctomycetes consistent 
with a role in nitrite oxidation [31-34] (Figure 6). The 
second step in denitrification is nitrite reduction to nitric 
oxide. Within the DNA dataset both bacterial and archaeal 
CDS for nitrite reductase were recovered while transcripts 
originating from ammonia oxidizing archaea dominated the 
RNA dataset (Figure 6). Coding sequences/transcripts for 
downstream pathway components including nitric oxide 
reductase and nitrous oxide reductase were not detected, 
although CbbQ/NirQ/NorQ family regulators necessary for 
inorganic carbon fixation in the Calvin-Benson-Bassham 
cycle, nitrite and nitric oxide reduction were identified in 
DNA and RNA datasets [35] (Figure 6). Given that the 
mean oxygen concentration at 500 m is -120 (iM O2 per 



kg [18,20], these results are consistent with active water 
column nitrite and ammonia oxidation processes. Recent 
studies in the Eastern Tropical South Pacific OMZ ob- 
served changes in the frequency distribution of denitri- 
fication genes between free-living (0.2-1.6 (im) and 
particle-associated (>1.6 (im) size fractions, with nitric 
oxide reductase and nitrous oxide reductase encoding 
genes enriched on particles [36]. The extent to which 
denitrification or anammox processes partition between 
free-living and particle-associated microoganisms in the 
HOT water column remains to be determined. 

Conclusions 

While advances in high throughput sequencing tech- 
nologies are rapidly giving rise to tens of thousands of 
environmental datasets, the computational and analytic 
powers needed to organize, interpret and mobilize these 
datasets have lagged behind. Conventional BLAST-based 
annotation methods combined with gene-centric analyses 
tend to overlook the network properties of microbial 
communities driving ecological and biogeochemical inter- 
actions. We argue that pathway-centric analyses via the 
MetaPathways pipeline and Pathway Tools provides the 
scientific user community with an end-to-end solution for 
comparing ePGDBs constructed from environmental se- 
quence information revealing known and novel network 
properties. As with any automated analysis, this method 
is no replacement for manual curation. Indeed, we have 
highlighted specific instances where taxonomic range, 
idiosyncratic annotation, multifunctional enzymes, regula- 
tory functions, and reversible enzymatic forms predicted 
by Pathway Tools result in interpretive hazards that 
require expert knowledge to resolve. 

Continued development efforts are needed to improve 
on existing features and add new functionality to both the 
MetaPathways pipeline and Pathway Tools. Specifically, 
improved import features amenable to categorical metadata 
e.g., taxonomic origin, location, depth, etc., need to be 
integrated with Pathway Tools 'groups', a feature that 
enables users to integrate external data and group 
pathways and objects within Pathway Tools. The 'groups' 
feature in turn needs to be better integrated into the 
omics' viewer allowing for improved pathway navigation 
and page summaries within the Pathway Tools browser. 
Tooltip enhancements that summarize the categorical 
data mentioned above could further enhance the browsing 
experience. Current ePGDBs are constructed using 
concatenated CDS sequences and improved viewing 
features are needed that map coverage and noncoding 
sequence information onto complete contigs. Finally, 
the PathoLogic algorithm should be improved to 
incorporate the described prediction hazards and WTD 
into its calculations. Specifically, one can imagine tree- 
based algorithmic improvements to PathoLogic akin to 
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the WTD described here that integrate taxonomic infor- 
mation with enzyme or pathway directionaUty. 

Despite current Umitations, ePGDBs provide an inter- 
active and hoUstic data structure in which to investigate 
distributed metabolism and differentiate between micro- 
bial community metabolic potential and phenotypic ex- 
pression. Thus, ePGBDs provide a functional blueprint of 
microbial community metabolism that can be harnessed 
to engineer microbial consortia with defined emergent 
properties. These properties can in turn be transferred to 
industrial strains or modeled using MetaFlux to improve 
process performance [13]. Although the set-difference and 
visual inspection methods used to identify distributed 
metabolic pathways described here do not scale for big 
datasets, future algorithmic improvements will enable 
comparisons of reference genomes and metagenomes in 
large numbers. Indeed, splitting the proverbial "reaction 
arrows" for each step in a given metabolic pathway into 
taxonomic bins provides a basis for integer optimization 
methods that compute "distribution" scores and a baseline 
for monitoring changes in the reaction network associated 
with environmental change or even human health status. 
Looking forward, we envision an open source collection of 
ePGDBs, called EngCyc analogous to BioCyc [16], which 
can be queried and compared online revealing the network 
properties of microbial communities in natural and 
engineered ecosystems on a truly global scale. 



Methods 

Metabolic pathway analysis 

Environmental PGDBs were constructed from public 
datsets using MetaPathways (http://github.com/hallamlab/ 
MetaPathways/) [14] with default parameter settings: open 
reading frame (ORF) detection by Prodigal (minimum 
length 60 amino acids), functional annotation by BLAST 
(e- value le-5, blast-score ratio 0.4) against protein data- 
bases KEGG [37], COG [38], MetaCyc [11] (version 16.0), 
and RefSeq [39] (Downloaded August 2012), and pathway 
prediction via the PathoLogic algorithm with taxonomic 
pruning disabled. Predicted pathways and associated 
annotated CDS sequences were extracted from created 
ePGDBs using the utility script extract_pathway_table_- 
from_pgdb.pl included with MetaPathways. 

Pathway prediction on simulated data 

Simulated sequencing experiments were performed using 
MetaSim [22] with the parameter settings: Long read: 
clone size 36000 bp, Gaussian error, mean read length 
700 bp, standard deviation 100 bp; Short read: Gaussian 
error, mean 160 bp, standard deviation 40 bp) against 
the E. coli K12 MG1655 complete nucleotide genome 
(GenBank: NC_000913) at a series of fractional levels 
(1/32, 1/16, 1/8, 1/4, 1/2, 1/1) of the total combined 
length of starting component genomes (Gm). Pathways 
were predicted using the MetaPathways pipeline, as 
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described above, against each of the resulting sequence 
sets (Additional file 1: Tables S3 and S4). A classification 
performance analysis was performed; True positives 
(TP) were pathways found in both the simulated sample 
pathways (test set) and the complete gold standard 
E. coli genome. True negatives (TN) were pathways not 
predicted in the test set or gold standard. False positives 
(FP) were pathways found in the test set but not in the 
gold standard. Finally, false negatives (FN) were path- 
ways found in the gold standard but not in the test set. 
Multiple summary statistics for the resulting confusion 
tables (Sensitivity (Recall), Specificity, Precision, Accuracy, 
F-measure, and Matthews Correlation Coefficient (MCC)) 
were calculated. A summary of these performance sta- 
tistics is provided in the supplement (Additional file 1: 
Note SI: A Note on Confusion Table Statistics'). 

Simulated metagenomes: Sim1, Sim2 

Simulated sequencing experiments of metagenomes Siml 
and Sim2 were generated and analyzed as described above 
for E. colL To minimize name-mapping problems, we used 
prokaryotic genomes from the tier-2 BioCyc database 
collection [21]. The Siml metagenome was composed 
of ten tier-2 BioCyc genomes (Additional file 1: Table S2) 
in equal copy number, while Sim2 was composed of the 
Caulobacter cresentus N A 1000 genome in 20-fold excess 
relative to other genomes (Additional file 1: Figure SI). A 
classification performance analysis was performed as de- 
scribed above with the set of 646 pathways predicted from 
the complete tier-2 genomes used to derive Siml and 
Sim2 representing the gold standard (Additional file 1: 
Tables S5-S8). 

Simulated metagenomes: HOT (25 m) 

A 25 m metagenome from the Hawaii ocean time series 
was sub-sampled with replacement to different fractional 
levels (1/20, 1/10, 3/20, 1/5, 2/5, 3/5, 4/5, and 1/1) and 
pathways were predicted as described above. Similarly, a 
classification performance analysis was performed with 
the set of 864 pathways predicted from the complete 
454 run representing the gold standard (Additional file 1: 
Tables S7 and SB). 

Taxonomic pruning experiments 

The fuU-Gm simulated sequencing samples for Siml and 
Sim2, both short and long read lengths, and the fuU-Gm 
HOT (25 m) sample, had their pathways predicted with 
the above method, but with taxonomic pruning enabled 
using the taxonomic lineage parameter set to "Unclassi- 
fied sequences". The number of predicted pathways were 
tabulated and compared with the pathways previously 
predicted with taxonomic pruning disabled. As simple 
set analysis showed that within a sample the pruned path- 
ways were a strict subset of the "no-pruning" ones, and 



the reduction in pathways was calculated (Additional 
file 1: Table S9). 

Weighted taxonomic distance 

For each predicted pathway in the HOT dataset, a 
weighted taxonomic distance (WTD) distance was 
calculated using the WTD algorithm (Additional file 1: 
Supplementary Note 2). First, the lowest common ancestor 
algorithm (LCA) was applied to a pathways RefSeq CDS 
sequences. The WTD algorithm calculates a weighted dis- 
tance D between the observed LCA taxonomy Xobs and the 
pathways expected taxonomic range(s) Xexp^ jj^iMetaCyc) 
(p)y where XR^^^^^^^^^^ip) is the set of taxonomic range(s) for 
a given pathway p on the NCBI Taxonomy Database 
hierarchy. 

This WTD algorithm takes as input p and Xobs> and 
calculates a weighted taxonomic distance for each Xexp 
on nodes in the connecting path P{Xexpy ^obs)> as 

where ea,b is an edge between nodes a and b in the 
path and d(a) is the depth of node a. If x^xp descends 
from the expected taxonomic range Xobs> then the 
WTD is assigned a positive value and WTD for paths 
descending outside this range are assigned a negative 
value. After calculating the WTDs for all pairs Xexp> Xobs> 
the WTD algorithm first attempts to return the mini- 
mum non-negative distance e.g., WTD corresponding to 
the closest Xexp where Xobs is a descendant of Xexp> and 
returns the maximum negative score e.g., closest to zero if 
all observed and expected taxonomies diverge. For each 
dataset, predicted pathways were assigned to a "Disagree- 
ment Class" based on the following criteria: (i) pathways 
with positive WTD were given the "None" class, (ii) path- 
ways with distances greater than the median of negative 
WTDs were given the "Low" class, (iii) pathways within 
the 2^*^ quartile were given the "Medium" class, and 
(iv) pathways in the lower quartile were given the 
"High" disagreement class (Additional file 1: Figure S2). 
The expected taxonomic ranges of each pathway where 
then collapsed into the higher taxonomic levels: "root", 
"cellular organisms", "prokaryotes", "archaea", "bacteria", 
"eukaryotes", "animals", "fungi", "plants", and "other", as 
defined on the NCBI Taxonomy Database hierarchy and 
pathway frequencies and disagreement classes were sum- 
marized for each sample (Additional file 1: Figure S3). 

Distributed metabolic pathway prediction 

Four genomes of similar size and complexity from the 
tier-2 dataset were combined in a pairwise manner: 
Aurantimonas manganoxydans SI85-9A (GenBank: NZ_ 
AAPJOOOOOOOO.l), Bacillus subtilis subtilis 168 (GenBank: 
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AL009126.3), Caulobacter crescentus NAIOOO (GenBank: 
CP001340.1), and Helicobacter pylori 26695 (GenBank: 
AE000511.1), abbreviated by the first character of their 
proper names, A, B, C, and H, respectively. The six pair- 
wise and four original genomes were analyzed as de- 
scribed above for E. coli (Additional file 1: Table SIO). 
Pathways predicted in the combined PGDBs were con- 
sidered candidates for distributed metabolism if they 
were absent from PGDBs for individual genomes (i.e., 
found in A and B combined, but not in either A or B in- 
dividually) (Additional file 1: Table Sll and Additional 
file 2). Candidate pathways were manually inspected 
and deemed plausible' if there was sufficient coverage, i.e., 
75% of reactions in a pathway had associated CDS 
sequences from both taxa (Additional file 1: Figure S4). 

Similarly, the Candidatus Moranella endobia and 
Candidatus Tremblaya princeps genomes (GenBank: 
NC-015735 and NC-015736) were downloaded from 
NCBI and analyzed as described above for E, coli. 
Resulting PGDBs for individual and combined genomes 
were manually inspected for amino acid biosynthetic 
pathways described in McCutcheon and Dohlen [17] 
(Additional file 1: Figure S5). 

Hawaii ocean time-series 

Unassembled metagenomic and transcriptomic pyrose- 
quences from the Hawaii Ocean Time-series (10 m, 75 m, 
110 m, and 500 m) were obtained from the NCBI Sequence 
Read Archive (SRA Accession: SRX007372, SRX007369, 
SRX007370, SRX007371, SRX016893, SRX016897, SRX 
156384, SRX156385) and run through the MetaPathways 
pipeline using default settings (Additional file 3). To avoid 
spurious predictions, only pathways with more than ten 
mapped CDS sequences in an individual sample were used 
in downstream analysis. The pathways with nine or fewer 
mapped CDS sequences represent the lower quartile of 
pathway annotations (Figure 4a, Additional file 4). Pathway 
CDS counts for each sample were normalized to the total 
number of unannotated ORFs in each dataset. Count data 
was then converted to percentages providing relative ORF 
abundance for each pathway (Additional file 5), along with 
their weighted taxonomic distances and sample-wise 
disagreement classes (Additional file 6). Relative CDS 
abundance of the top-40 pathways from DNA and RNA 
datasets were compared (Additional file 1: Figure S8). In 
addition, pathways predicted in the DNA and RNA 
datasets were compared at each depth interval to provide 
sample-wise fractions for each depth e.g., DNA-only, 
DNA-RNA, and RNA-only (Figure 4c). Given the 
small number of pathways in the RNA-only sets no 
set-difference analysis was needed (Additional file 1: 
Figure S6). The DNA-only sets were declined and tabu- 
lated at various levels of the MetaCyc pathway hierarchy 
(Additional file 1: Figure S7). A final four- way set analysis 



was performed on the DNA-only and DNA-RNA path- 
ways at each depth (Figure 4d, Additional files 7 and 8). 
DNA-RNA set-difference subsets with more than 5 
predicted pathways were compared in detail (Additional 
file 1: Figures S9-S14). All data transformations, set opera- 
tions, and comparisons were performed in the R statistical 
environment (http://www.r-project.org), and visualized 
using the ggplot graphical package (http://ggplot2.org) 
and d3.js graphical library (http://d3js.org/). 

Availability of supporting data 

The ten full-length genomes used to create simulated 
metagenomes can be downloaded from GenBank under 
accession numbers AE008687-AE008690, NZ_AAPJOO 
000000.1, AL009126.3, AE005673, CP001340.1, AE000511.1, 
AE000516, AL123456, NC_007604.1, AE003852, and 
AE003853. 

The symbiotic Candidatus Moranella endobia and Can- 
didatus Tremblaya princeps genomes can be downloaded 
from GenBank under accession numbers NC-015735 and 
NC-015736). The Hawaii Ocean Time series datasets can 
be downloaded from the NCBI Sequence Read Archive 
under accession numbers SRX007372, SRX007369, 
SRX007370, SRX007371, SRX016893, SRX016897, SRX 
156384, SRX156385. 

Additional files 



Additional file 1: Supplementary notes, figures, and tables. 

Additional file 2: Summary of candidate pathways that are 
potentially distributed by set-difference analysis. 

Additional file 3: Summary table of 1033 pre-QC predicted pathways 
and CDS counts for the Hawaii Ocean Time-series samples. 

Additional file 4: Summary table of the 840 post-QC predicted 
pathways and CDS counts for the Hawaii Ocean Time-series 
samples. 

Additional file 5: Summary table of the 840 post-QC predicted 
pathways and normalized CDS counts for the Hawaii Ocean 
Time-series samples with taxonomic disagreement class highlighted. 

Additional file 6: Summary table of the 840 post-QC predicted 
pathways and normalized CDS counts for the Hawaii Ocean 
Time-series samples with observed LCA taxonomies, expected 
taxonomic ranges, calculated weighted taxonomic distance, and 
taxonomic disagreement class. 

Additional file 7: Summary table of normalized CDS counts for the 
593 DNA fraction pathways of samples from the Hawaii Ocean 
Time-series. 

Additional file 8: Summary table of normalized CDS counts for the 
495 pathways common to DNA and RNA samples from the Hawaii 

Ocean Time-series. 
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